Blocking Blog Spam with Language Model Disagreement

نویسندگان

  • Gilad Mishne
  • David Carmel
  • Ronny Lempel
چکیده

We present an approach for detecting link spam common in blog comments by comparing the language models used in the blog post, the comment, and pages linked by the comments. In contrast to other link spam filtering approaches, our method requires no training, no hard-coded rule sets, and no knowledge of complete-web connectivity. Preliminary experiments with identification of typical blog spam show promising results.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Online Spam Detection in Blogs: A Behavior-based Approach

With the increasing usage of user generated content based social networks, spam content is surging by taking advantage of the convenience of web posting. Modern spammers in social networks insert popular keywords or even copy and paste recent articles on the web with spam links inserted, in order to evade language model based spam detection. In this paper, we first conduct a comprehensive analy...

متن کامل

AIRWeb 2005 Proceedings

We present an approach for detecting link spam common in blog comments by comparing the language models used in the blog post, the comment, and pages linked by the comments. In contrast to other link spam filtering approaches, our method requires no training, no hard-coded rule sets, and no knowledge of complete-web connectivity. Preliminary experiments with identification of typical blog spam ...

متن کامل

Library blogs and user participation: a survey about comment spam in library blogs

Purpose The purpose of this research is to identify and describe the impact of comment spam in library blogs. Three research questions guided the study: current level of commenting in library blogs; librarians' perception of comment spam; and techniques used to address the comment spam problem. Design/methodology/approach A quantitative approach is used to investigate research questions. Inform...

متن کامل

Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics Proceedings of the Main Conference

Email is the number one activity that people do on the internet: 74% of internet users check their email on an average day. Email use in offices has more than doubled since 2000, and is now over 8 hours a week. There are many great NLP problems for email, like automatic clustering and foldering, search, prioritization, automatically finding keywords within messages, finding addresses, and summa...

متن کامل

Detecting Blog Spams using the Vocabulary Size of All Substrings in Their Copies

This paper addresses the problem of detecting blog spams, which are unsolicited messages on blog sites, among blog entries. Unlike a spam mail, a typical blog spam is produced to increase the PageRank for the spammer’s Web sites, and so many copies of the blog spam are necessary and all of them contain URLs of the sites. Therefore the number of the copies, we call it the frequency, seems to be ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005